Skip to content

Conversation

@aritrbas
Copy link
Collaborator

Summary

This adds HTTP-based health check endpoints for the Calico VPP agent, replacing the existing restart-on-timeout behavior with Kubernetes readiness and liveness probes.

Previously, the agent container would restart frequently while waiting for Felix configuration updates. This caused pods to appear Running even when not fully initialized making it difficult to distinguish between initialization delays and actual failures.

Now, we report initialization status through standard Kubernetes probes, keeping the container running during initialization by marking it as Not Ready. This allows Kubernetes to manage pod lifecycle based on health check status.

Changes

1. New Health Package (calico-vpp-agent/health/)

Created a new package with:

  • health.go: HTTP server with three endpoints:
    • /liveness: Basic health status (for liveness probe)
    • /readiness: Initialization status (for readiness probe)
    • /status: Detailed JSON status (for monitoring/debugging)

2. Configuration Changes (config/config.go)

Added healthcheck port configuration:

// HealthCheckPort is the port on which the health check HTTP server listens
// Defaults to 9090
HealthCheckPort *uint32 `json:"healthCheckPort"`

The healthcheck port can be customized via ConfigMap:

  CALICOVPP_INITIAL_CONFIG: |-
    {
      "healthCheckPort": 9090,
    }

3. Deployment YAML Changes (yaml/base/calico-vpp-daemonset.yaml)

Added Kubernetes health probes to agent container:

  startupProbe:
    failureThreshold: 10
    httpGet:
      path: /liveness
      port: 9090
      scheme: HTTP
    initialDelaySeconds: 30
    periodSeconds: 30
    timeoutSeconds: 3

  livenessProbe:
    failureThreshold: 3
    httpGet:
      path: /liveness
      port: 9090
      scheme: HTTP
    initialDelaySeconds: 30
    periodSeconds: 10
    timeoutSeconds: 3

  readinessProbe:
    failureThreshold: 3
    httpGet:
      path: /readiness
      port: 9090
      scheme: HTTP
    initialDelaySeconds: 10
    periodSeconds: 5
    timeoutSeconds: 3

Components Tracked

The health system tracks the initialization of these components:

  1. vpp: VPP connection established
  2. vpp-manager: VPP Manager ready
  3. felix: Felix configuration received
  4. agent: Agent fully initialized and running

Monitoring

The /status endpoint provides detailed information about the healhcheck status. Here is an example status response:

{
  "healthy": true,
  "ready": true,
  "components": {
    "agent": {
      "initialized": true,
      "message": "Agent fully initialized and running",
      "updatedAt": "2024-10-15T22:30:00Z"
    },
    "felix": {
      "initialized": true,
      "message": "Felix config received",
      "updatedAt": "2024-10-15T22:29:45Z"
    },
    "vpp": {
      "initialized": true,
      "message": "VPP connection established",
      "updatedAt": "2024-10-15T22:29:30Z"
    },
    "vpp-manager": {
      "initialized": true,
      "message": "VPP Manager ready",
      "updatedAt": "2024-10-15T22:29:35Z"
    }
  },
  "message": "All components initialized",
  "lastUpdate": "2024-10-15T22:30:00Z"
}

Copy link
Collaborator

@hedibouattour hedibouattour left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this change, great !
If I understand correctly this is never gonna timeout and crash ? So if felix doesn't send its config at all we are just stuck at notReady state ?

healthServer.MarkAsUnhealthy("Waiting for Felix configuration")
log.Info("Waiting for Felix configuration...")

ticker := time.NewTicker(20 * time.Second)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can consider reducing the interval; 20s might be too long for retries

@aritrbas
Copy link
Collaborator Author

aritrbas commented Oct 23, 2025

If I understand correctly this is never gonna timeout and crash ? So if felix doesn't send its config at all we are just stuck at notReady state ?

  • At startup, Kubernetes uses the startupProbe every 30 seconds after an initial delay of 30s. It will give up after 10 consecutive failures, i.e., after (30 + 10×30) = 330 seconds. So, if this never succeeds, the container will crash after about 5½ minutes.
  • Once the startupProbe succeeds, the livenessProbe takes over. If it fails 3 times in a row (each 10s apart), Kubernetes will restart the container (≈30s of failure).
  • The readinessProbe controls whether the pod is Ready to receive traffic. If it never succeeds, the pod will stay in NotReady state (i.e., not added to service endpoints), but it will not crash.

So, if Felix doesn't send its config, the startupProbe will cause us to crash every 5½ minutes until the config is received. Once the config is received and the startupProbe succeeds, we only crash if the livenessProbe fails (that can only happen now if the agent crashes or the Go routine managing the health server goes down).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants